Take-home Ex03

Author

Ryan Peh

Published

October 31, 2024

Modified

November 3, 2024

Overview and Objectives

In this take-home my aim is to prototype and evaluate the visualisation of pages and components in our Shiny application.

My responsibilities will be performing Geographically Weighted Regression (GWR) on the districts of Malaysia.

Data

Packages

  • olsrr: Provides tools for building and validating OLS regression models with stepwise selection and diagnostics.
  • ggstatsplot: Extends ggplot2 with statistical tests and data visualization in a single, user-friendly syntax.

  • sf: Provides a standardized way to work with spatial vector data (points, lines, polygons).

  • tmap: Creates thematic maps with interactive and static modes for spatial visualization.

  • tidyverse: A collection of packages for easy data manipulation, visualization, and analysis.

  • performance: Offers tools to assess, validate, and compare regression models.

  • see: Supplies visual themes and color palettes to enhance ggplot2 visualizations.

  • sfdep: Provides spatial dependency analysis tools specifically for sf objects.

  • GWmodel: Implements geographically weighted regression (GWR) and spatial analysis methods.

  • lubridate: Simplifies the handling and manipulation of dates and times.

pacman::p_load(olsrr, ggstatsplot, sf, tmap, tidyverse, performance, see, sfdep, GWmodel, lubridate)

Data

Aspatial

crime <- read_csv("data/aspatial/crime_district.csv")
Rows: 19152 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): state, district, category, type
dbl  (1): crimes
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
inequality <- read_csv("data/aspatial/inequality_district.csv")
Rows: 318 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): state, district
dbl  (1): gini
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
income <- read_csv("data/aspatial/income_district.csv")
Rows: 318 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): state, district
dbl  (2): income_mean, income_median
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
poverty <- read_csv("data/aspatial/poverty_district.csv")
Rows: 318 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): state, district
dbl  (2): poverty_absolute, poverty_relative
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Geospatial

msia = st_read(dsn = "data/geospatial", layer = "mys_admbnda_adm2_unhcr_20210211")
Reading layer `mys_admbnda_adm2_unhcr_20210211' from data source 
  `C:\ryanpxp\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 144 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 99.64072 ymin: 0.855001 xmax: 119.2697 ymax: 7.380556
Geodetic CRS:  WGS 84

Data Wrangling

st_crs(msia)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
msia <- msia %>%
  st_transform(crs = 3168)
st_crs(msia)
Coordinate Reference System:
  User input: EPSG:3168 
  wkt:
PROJCRS["Kertau (RSO) / RSO Malaya (m)",
    BASEGEOGCRS["Kertau (RSO)",
        DATUM["Kertau (RSO)",
            ELLIPSOID["Everest 1830 (RSO 1969)",6377295.664,300.8017,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4751]],
    CONVERSION["Rectified Skew Orthomorphic Malaya Grid (metre)",
        METHOD["Hotine Oblique Mercator (variant A)",
            ID["EPSG",9812]],
        PARAMETER["Latitude of projection centre",4,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8811]],
        PARAMETER["Longitude of projection centre",102.25,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8812]],
        PARAMETER["Azimuth of initial line",323.0257905,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8813]],
        PARAMETER["Angle from Rectified to Skew Grid",323.130102361111,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8814]],
        PARAMETER["Scale factor on initial line",0.99984,
            SCALEUNIT["unity",1],
            ID["EPSG",8815]],
        PARAMETER["False easting",804670.24,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Malaysia - West Malaysia onshore."],
        BBOX[1.21,99.59,6.72,104.6]],
    ID["EPSG",3168]]
str(msia)
Classes 'sf' and 'data.frame':  144 obs. of  15 variables:
 $ ADM2_EN   : chr  "Batu Pahat" "Johor Bahru" "Kluang" "Kota Tinggi" ...
 $ ADM2_PCODE: chr  "MY0101" "MY0102" "MY0103" "MY0104" ...
 $ ADM2_REF  : chr  NA NA NA NA ...
 $ ADM2ALT1EN: chr  NA NA NA NA ...
 $ ADM2ALT2EN: chr  NA NA NA NA ...
 $ ADM1_EN   : chr  "Johor" "Johor" "Johor" "Johor" ...
 $ ADM1_PCODE: chr  "MY01" "MY01" "MY01" "MY01" ...
 $ ADM0_EN   : chr  "Malaysia" "Malaysia" "Malaysia" "Malaysia" ...
 $ ADM0_PCODE: chr  "MY" "MY" "MY" "MY" ...
 $ date      : Date, format: "2020-12-02" "2020-12-02" ...
 $ validOn   : Date, format: "2021-02-11" "2021-02-11" ...
 $ validTo   : Date, format: "-001-11-30" "-001-11-30" ...
 $ Shape_Leng: num  1.86 1.83 2.36 3.15 1.07 ...
 $ Shape_Area: num  0.161 0.0802 0.2336 0.2783 0.0615 ...
 $ geometry  :sfc_MULTIPOLYGON of length 144; first list element: List of 1
  ..$ :List of 1
  .. ..$ : num [1:1140, 1:2] 556715 556665 556609 556548 556517 ...
  ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
 - attr(*, "sf_column")= chr "geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
  ..- attr(*, "names")= chr [1:14] "ADM2_EN" "ADM2_PCODE" "ADM2_REF" "ADM2ALT1EN" ...
u_msia <- st_union(msia)
plot(u_msia)

crime
# A tibble: 19,152 × 6
   state    district category type           date       crimes
   <chr>    <chr>    <chr>    <chr>          <date>      <dbl>
 1 Malaysia All      assault  all            2016-01-01  22327
 2 Malaysia All      assault  all            2017-01-01  21366
 3 Malaysia All      assault  all            2018-01-01  16902
 4 Malaysia All      assault  all            2019-01-01  16489
 5 Malaysia All      assault  all            2020-01-01  13279
 6 Malaysia All      assault  all            2021-01-01  11495
 7 Malaysia All      assault  all            2022-01-01  10348
 8 Malaysia All      assault  all            2023-01-01  10453
 9 Malaysia All      assault  causing_injury 2016-01-01   5531
10 Malaysia All      assault  causing_injury 2017-01-01   5024
# ℹ 19,142 more rows

looking at the crime object we can see that “All” fields is included

filter out columns where district is “All”

crime_filtered <- crime %>% filter(district != "All")

Convert the date column to year format and removed old date

keep only 2019 and 2022 to match the other datasets

crime_filtered <- crime_filtered %>%
  mutate(year = year(date))%>%
  select(-date) %>%
  filter(year %in% c(2019, 2022))         
income <- income %>%
  mutate(year = year(date))%>%
  select(-date)           
poverty <- poverty %>%
  mutate(year = year(date))%>%
  select(-date)           
inequality <- inequality %>%
  mutate(year = year(date))%>%
  select(-date)           
na <- crime_filtered %>%
  summarise(na_district = sum(is.na(district)),
            na_category = sum(is.na(category)),
            na_type = sum(is.na(type)),
            na_date = sum(is.na(date)),
            na_crimes = sum(is.na(crimes))
            )
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `na_date = sum(is.na(date))`.
Caused by warning in `is.na()`:
! is.na() applied to non-(list or vector) of type 'closure'
print(na)
# A tibble: 1 × 5
  na_district na_category na_type na_date na_crimes
        <int>       <int>   <int>   <int>     <int>
1           0           0       0       0         0
na <- inequality %>%
  summarise(na_district = sum(is.na(district)),
            na_date = sum(is.na(date)),
            na_gini = sum(is.na(gini))
            )
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `na_date = sum(is.na(date))`.
Caused by warning in `is.na()`:
! is.na() applied to non-(list or vector) of type 'closure'
print(na)
# A tibble: 1 × 3
  na_district na_date na_gini
        <int>   <int>   <int>
1           0       0       0

< do same for other dataset >

function for renaming

rename_districts <- function(data) {
  data <- data %>%
    mutate(district = case_when(
      district %in% c("Iskandar Puteri", "Nusajaya", "Johor Bahru Selatan", "Johor Bahru Utara", "Seri Alam") ~ "Johor Bahru",
      district == "Bandar Bharu" ~ "Bandar Baharu",
      district %in% c("Brickfields", "Cheras", "Dang Wangi", "Sentul", "Wangsa Maju","W.P. Kuala Lumpur") ~ "WP. Kuala Lumpur",
      district == "Nilai" ~ "Seremban",
      district == "Cameron Highland" ~ "Cameron Highlands",
      district == "Kuala Lipis" ~ "Lipis",
      district %in% c("Batu Gajah", "Ipoh") ~ "Kinta",
      district == "Gerik" ~ "Ulu Perak",
      district == "Manjung" ~ "Manjung (Dinding)",
      district == "Pangkalan Hulu" ~ "Ulu Perak",
      district %in% c("Selama", "Taiping", "Larut dan Matang") ~ "Larut Dan Matang",
      district == "Sungai Siput" ~ "Kuala Kangsar",
      district %in% c("Tanjong Malim", "Tapah", "Bagan Datuk", "Muallim") ~ "Batang Padang",
      district %in% c("Arau", "Kangar", "Padang Besar") ~ "Perlis",
      state == "Pulau Pinang" & district == "Seberang Perai Selatan" ~ "S.P.Selatan",
      district == "Seberang Perai Tengah" ~ "S.P. Tengah",
      district == "Seberang Perai Utara" ~ "S.P. Utara",
      district == "Ampang Jaya" ~ "Gombak",
      district == "Kajang" ~ "Ulu Langat",
      district %in% c("Pengkalan Hulu","Hulu Perak") ~ "Ulu Perak",
      district == "Hulu Selangor" ~ "Ulu Selangor",
      district %in% c("Klang Selatan", "Klang Utara") ~ "Klang",
      district %in% c("Petaling Jaya", "Serdang", "Sg. Buloh", "Shah Alam", "Subang Jaya", "Sungai Buloh") ~ "Petaling",
      district == "Kota Kinabatangan" ~ "Kinabatangan",
      district == "Kota Samarahan" ~ "Samarahan",
      district %in% c("Matu Daro", "Tanjung Manis") ~ "Mukah",
      district == "Padawan" ~ "Kuching",
      district == "Kulai" ~ "Kulaijaya",
      district == "Tangkak" ~ "Ledang",
      district == "Kecil Lojing" ~ "Gua Musang",
      district == "Kalabakan" ~ "Tawau",
      district == "Telupid" ~ "Beluran",
      district == "Beluru" ~ "Miri",
      district == "Bukit Mabong" ~ "Kapit",
      district == "Kabong" ~ "Saratok",
      district == "Maradong" ~ "Meradong",
      district == "Pusa" ~ "Betong",
      district == "Sebauh" ~ "Bintulu",
      district == "Subis" ~ "Miri",
      district == "Tebedu" ~ "Serian",
      district == "Telang Usan" ~ "Marudi",
      district == "Kuala Nerus" ~ "Kuala Terengganu",
      TRUE ~ district
    ))
  
  return(data)
}
crime_filtered <- rename_districts(crime_filtered)
crime_filtered <- crime_filtered %>%
  group_by(state, district, category, type, year) %>% 
  summarize(crimes = sum(crimes), .groups = 'drop')

Unique values in crime_filtered\(district that are not in msia\)ADM2_EN

geospatial_missing <- setdiff(unique(crime_filtered$district), unique(msia$ADM2_EN))

print(geospatial_missing)
character(0)
geospatial_missing <- setdiff(unique(income$district), unique(msia$ADM2_EN))

print(geospatial_missing)
 [1] "Kulai"                  "Tangkak"                "Kecil Lojing"          
 [4] "Bagan Datuk"            "Hulu Perak"             "Larut dan Matang"      
 [7] "Manjung"                "Muallim"                "Selama"                
[10] "Seberang Perai Selatan" "Seberang Perai Tengah"  "Seberang Perai Utara"  
[13] "Kalabakan"              "Telupid"                "Beluru"                
[16] "Bukit Mabong"           "Kabong"                 "Maradong"              
[19] "Pusa"                   "Sebauh"                 "Subis"                 
[22] "Tanjung Manis"          "Tebedu"                 "Telang Usan"           
[25] "Kuala Nerus"            "W.P. Kuala Lumpur"     
income <- rename_districts(income)
income <- income %>%
  group_by(state, district, year) %>% 
  summarize(
    income_mean = sum(income_mean),
    income_median = sum(income_median),
    .groups = 'drop'
  )
inequality <- rename_districts(inequality)
inequality <- inequality %>%
  group_by(state, district, year) %>% 
  summarize(
    inequality = sum(gini),
    .groups = 'drop'
  )
geospatial_missing <- setdiff(unique(inequality$district), unique(msia$ADM2_EN))

print(geospatial_missing)
character(0)
poverty <- rename_districts(poverty)
poverty <- poverty %>%
  group_by(state, district, year) %>% 
  summarize(
    poverty_relative = sum(poverty_relative),
    poverty_absolute = sum(poverty_absolute),
    .groups = 'drop'
  )
geospatial_missing <- setdiff(unique(poverty$district), unique(msia$ADM2_EN))

print(geospatial_missing)
character(0)

Start by joining crime with poverty, then add inequality and income

combined_data <- crime_filtered %>%
  left_join(poverty, by = c("state", "district", "year")) %>%
  left_join(inequality, by = c("state", "district", "year")) %>%
  left_join(income, by = c("state", "district", "year"))
msia_geometry <- msia %>%
  select(1, 13:15)
combined_data <- combined_data %>%
  left_join(msia_geometry, by = c("district" = "ADM2_EN"))

user can choose the type of crime data

combined_data_murder <- combined_data %>%
  filter(type == "murder")
combined_data_injury <- combined_data %>%
  filter(type == "causing_injury")
combined_data_all <- combined_data %>%
  filter(type == "all")
ggcorrmat(combined_data_murder[, 6:11])

ggcorrmat(combined_data_injury[, 6:11])

ggcorrmat(combined_data_all[, 6:11], sig.level = 0.05)

income median and mean are highly correlated since they are really similar and came from the same data set. Might be great for either of the variable to be removed for GWR.

function for generating regression model

user is able to select which variables and data set.

run_regression <- function(data, response, predictors) {
  # Create formula from response and predictors
  formula <- as.formula(
    paste(response, "~", paste(predictors, collapse = " + "))
  )
  
  # Run the linear model
  model <- lm(formula = formula, data = data)
  
  # Return the model
  return(model)
}
# Define predictors as a vector of variable names
predictors <- c("poverty_relative", "poverty_absolute", "inequality", "income_mean", "income_median")

# Run the function with the specified data and predictors
murder_model <- run_regression(
  data = combined_data_murder,
  response = "crimes",
  predictors = predictors
)

ols_regress(murder_model)
                         Model Summary                           
----------------------------------------------------------------
R                       0.695       RMSE                  2.760 
R-Squared               0.483       MSE                   7.797 
Adj. R-Squared          0.472       Coef. Var           132.240 
Pred R-Squared          0.444       AIC                1279.749 
MAE                     1.805       SBC                1304.674 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares         DF    Mean Square      F         Sig. 
---------------------------------------------------------------------
Regression    1847.330          5        369.466    47.386    0.0000 
Residual      1980.435        254          7.797                     
Total         3827.765        259                                    
---------------------------------------------------------------------

                                      Parameter Estimates                                        
------------------------------------------------------------------------------------------------
           model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
------------------------------------------------------------------------------------------------
     (Intercept)      0.165         0.582                  0.283    0.777     -0.981      1.310 
poverty_relative      0.029         0.020        0.104     1.509    0.133     -0.009      0.068 
poverty_absolute      0.018         0.024        0.057     0.757    0.450     -0.029      0.065 
      inequality    -22.041         3.030       -0.742    -7.274    0.000    -28.008    -16.073 
     income_mean      0.005         0.000        3.634    10.779    0.000      0.004      0.006 
   income_median     -0.005         0.001       -2.757    -8.686    0.000     -0.006     -0.004 
------------------------------------------------------------------------------------------------
# Define predictors as a vector of variable names
predictors <- c("poverty_relative", "poverty_absolute", "inequality", "income_mean", "income_median")

# Run the function with the specified data and predictors
all_model <- run_regression(
  data = combined_data_all,
  response = "crimes",
  predictors = predictors
)

ols_regress(all_model)
                           Model Summary                             
--------------------------------------------------------------------
R                         0.573       RMSE                  534.279 
R-Squared                 0.328       MSE                288786.035 
Adj. R-Squared            0.321       Coef. Var             209.201 
Pred R-Squared            0.293       AIC                  8021.851 
MAE                     261.063       SBC                  8051.627 
--------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                   ANOVA                                    
---------------------------------------------------------------------------
                     Sum of                                                
                    Squares         DF     Mean Square      F         Sig. 
---------------------------------------------------------------------------
Regression     72459258.257          5    14491851.651    50.182    0.0000 
Residual      148436021.866        514      288786.035                     
Total         220895280.123        519                                     
---------------------------------------------------------------------------

                                         Parameter Estimates                                           
------------------------------------------------------------------------------------------------------
           model         Beta    Std. Error    Std. Beta      t        Sig         lower        upper 
------------------------------------------------------------------------------------------------------
     (Intercept)       69.790        79.162                  0.882    0.378      -85.731      225.310 
poverty_relative        4.736         2.654        0.099     1.785    0.075       -0.477        9.950 
poverty_absolute        1.025         3.259        0.019     0.314    0.753       -5.379        7.428 
      inequality    -3251.324       412.351       -0.645    -7.885    0.000    -4061.425    -2441.224 
     income_mean        0.747         0.067        3.000    11.110    0.000        0.615        0.879 
   income_median       -0.683         0.076       -2.287    -8.996    0.000       -0.832       -0.534 
------------------------------------------------------------------------------------------------------

Define the function

run_stepwise_selection <- function(model, direction = "forward", p_val = 0.05, details = FALSE) {
  if (!direction %in% c("forward", "backward", "both")) {
    stop("Invalid direction. Choose from 'forward', 'backward', or 'both'.")
  }
  
  stepwise_model <- switch(
    direction,
    "forward" = ols_step_forward_p(model, p_val = p_val, details = details),
    "backward" = ols_step_backward_p(model, p_val = p_val, details = details),
    "both" = ols_step_both_p(model, p_val = p_val, details = details)
  )
  
  return(stepwise_model)
}

###forward

murder_fw_mlr <- run_stepwise_selection(
  model = murder_model,
  direction = "forward",
  p_val = 0.05,
  details = FALSE
)

print(murder_fw_mlr)

                                Stepwise Summary                                
------------------------------------------------------------------------------
Step    Variable           AIC         SBC        SBIC        R2       Adj. R2 
------------------------------------------------------------------------------
 0      Base Model       1441.080    1448.202    701.849    0.00000    0.00000 
 1      income_mean      1374.046    1384.728    634.769    0.23319    0.23022 
 2      inequality       1343.382    1357.625    604.139    0.32372    0.31846 
 3      income_median    1279.575    1297.379    541.796    0.47494    0.46879 
------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.689       RMSE                  2.780 
R-Squared               0.475       MSE                   7.851 
Adj. R-Squared          0.469       Coef. Var           132.696 
Pred R-Squared          0.445       AIC                1279.575 
MAE                     1.853       SBC                1297.379 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares         DF    Mean Square      F         Sig. 
---------------------------------------------------------------------
Regression    1817.974          3        605.991    77.189    0.0000 
Residual      2009.792        256          7.851                     
Total         3827.765        259                                    
---------------------------------------------------------------------

                                     Parameter Estimates                                      
---------------------------------------------------------------------------------------------
        model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
---------------------------------------------------------------------------------------------
  (Intercept)      0.004         0.575                  0.007    0.995     -1.129      1.137 
  income_mean      0.005         0.000        3.505    10.577    0.000      0.004      0.006 
   inequality    -17.547         1.825       -0.591    -9.616    0.000    -21.141    -13.954 
income_median     -0.005         0.001       -2.710    -8.587    0.000     -0.006     -0.004 
---------------------------------------------------------------------------------------------

backward

murder_bw_mlr <- run_stepwise_selection(
  model = murder_model,
  direction = "backward",
  p_val = 0.05,
  details = FALSE
)

print(murder_bw_mlr)

                                 Stepwise Summary                                  
---------------------------------------------------------------------------------
Step    Variable              AIC         SBC        SBIC        R2       Adj. R2 
---------------------------------------------------------------------------------
 0      Full Model          1279.749    1304.674    542.184    0.48261    0.47243 
 1      poverty_absolute    1278.335    1299.699    540.699    0.48145    0.47331 
 2      poverty_relative    1279.575    1297.379    541.796    0.47494    0.46879 
---------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.689       RMSE                  2.780 
R-Squared               0.475       MSE                   7.851 
Adj. R-Squared          0.469       Coef. Var           132.696 
Pred R-Squared          0.445       AIC                1279.575 
MAE                     1.853       SBC                1297.379 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares         DF    Mean Square      F         Sig. 
---------------------------------------------------------------------
Regression    1817.974          3        605.991    77.189    0.0000 
Residual      2009.792        256          7.851                     
Total         3827.765        259                                    
---------------------------------------------------------------------

                                     Parameter Estimates                                      
---------------------------------------------------------------------------------------------
        model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
---------------------------------------------------------------------------------------------
  (Intercept)      0.004         0.575                  0.007    0.995     -1.129      1.137 
   inequality    -17.547         1.825       -0.591    -9.616    0.000    -21.141    -13.954 
  income_mean      0.005         0.000        3.505    10.577    0.000      0.004      0.006 
income_median     -0.005         0.001       -2.710    -8.587    0.000     -0.006     -0.004 
---------------------------------------------------------------------------------------------

both

murder_sb_mlr <- run_stepwise_selection(
  model = murder_model,
  direction = "both",
  p_val = 0.05,
  details = FALSE
)

print(murder_sb_mlr)

                                   Stepwise Summary                                    
-------------------------------------------------------------------------------------
Step    Variable                  AIC         SBC        SBIC        R2       Adj. R2 
-------------------------------------------------------------------------------------
 0      Base Model              1441.080    1448.202    701.849    0.00000    0.00000 
 1      income_mean (+)         1374.046    1384.728    634.769    0.23319    0.23022 
 2      inequality (+)          1343.382    1357.625    604.139    0.32372    0.31846 
 3      income_median (+)       1279.575    1297.379    541.796    0.47494    0.46879 
 4      poverty_relative (+)    1278.335    1299.699    540.699    0.48145    0.47331 
-------------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.694       RMSE                  2.763 
R-Squared               0.481       MSE                   7.784 
Adj. R-Squared          0.473       Coef. Var           132.130 
Pred R-Squared          0.448       AIC                1278.335 
MAE                     1.813       SBC                1299.699 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares         DF    Mean Square      F         Sig. 
---------------------------------------------------------------------
Regression    1842.865          4        460.716    59.188    0.0000 
Residual      1984.900        255          7.784                     
Total         3827.765        259                                    
---------------------------------------------------------------------

                                      Parameter Estimates                                        
------------------------------------------------------------------------------------------------
           model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
------------------------------------------------------------------------------------------------
     (Intercept)      0.102         0.575                  0.178    0.859     -1.031      1.236 
     income_mean      0.005         0.000        3.623    10.767    0.000      0.004      0.006 
      inequality    -20.859         2.594       -0.702    -8.040    0.000    -25.968    -15.750 
   income_median     -0.005         0.001       -2.774    -8.770    0.000     -0.006     -0.004 
poverty_relative      0.033         0.019        0.119     1.788    0.075     -0.003      0.070 
------------------------------------------------------------------------------------------------
metric <- compare_performance(murder_model, 
                    murder_fw_mlr$model,
                    murder_bw_mlr$model,
                    murder_sb_mlr$model)
metric$Name <- gsub(".*\\\\([a-zA-Z0-9_]+)\\\\, \\\\model\\\\.*", "\\1", metric$Name)
plot(metric)

Function to create a coefficient plot

create_coef_plot <- function(model, conf.level = 0.95) {
  ggcoefstats(model,
              sort = "ascending",
              title = "Estimated Coefficient",
              conf.level = conf.level) +
    theme(text = element_text(size = 16))
}
create_coef_plot(murder_sb_mlr$model, conf.level = 0.95)

Checksfor multicollinearity

out <- plot(check_model(murder_sb_mlr$model, 
                        panel = FALSE))
For confidence bands, please install `qqplotr`.
out[[2]]

normality test

plot(check_normality(murder_sb_mlr$model))
For confidence bands, please install `qqplotr`.

A histogram might be a better approach for user to tell the distribution at a glance instead.

ols_plot_resid_hist(murder_sb_mlr$model)

plot(check_outliers(murder_sb_mlr$model,
                           method = "cook"))

mlr_output <- as.data.frame(murder_sb_mlr$model$residuals) %>%
  rename(`SB_MLR_RES` = `murder_sb_mlr$model$residuals`)

because of empty rows we need to pad “NA”

murder_residual <- data.frame(MLR_RES = rep(NA, nrow(combined_data_murder)))

murder_residual[rownames(mlr_output), "MLR_RES"] <- mlr_output$SB_MLR_RES
combined_data_murder <- cbind(combined_data_murder, 
                        murder_residual)
combined_data_murder_st <- st_as_sf(combined_data_murder)
tm_shape(msia)+
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.4) +
tm_shape(combined_data_murder_st) +  
  tm_dots(col = "MLR_RES",
          alpha = 0.6,
          style="quantile") 
Variable(s) "MLR_RES" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

plot(check_collinearity(murder_sb_mlr$model)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Variable `Component` is not in your data frame :/

while its good to show the graph above is great to show the collinearity, showing it in a map might be better as shown below

tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(combined_data_murder_st) +
  tm_polygons(col = "MLR_RES", alpha = 0.6)+
  tm_view(set.zoom.limits = c(5, 9))
Variable(s) "MLR_RES" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
tmap_mode("plot")
tmap mode set to plotting
run_regression <- function(data, response, predictors) {
  # Create formula from response and predictors
  formula <- as.formula(
    paste(response, "~", paste(predictors, collapse = " + "))
  )
  
  # Run the linear model
  model <- lm(formula = formula, data = data)
  
  # Return the model
  return(model)
}
# Define predictors as a vector of variable names
predictors <- c("poverty_relative", "poverty_absolute", "inequality", "income_mean", "income_median")

# Run the function with the specified data and predictors
murder_model <- run_regression(
  data = combined_data_murder,
  response = "crimes",
  predictors = predictors
)

ols_regress(murder_model)
                         Model Summary                           
----------------------------------------------------------------
R                       0.695       RMSE                  2.760 
R-Squared               0.483       MSE                   7.797 
Adj. R-Squared          0.472       Coef. Var           132.240 
Pred R-Squared          0.444       AIC                1279.749 
MAE                     1.805       SBC                1304.674 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares         DF    Mean Square      F         Sig. 
---------------------------------------------------------------------
Regression    1847.330          5        369.466    47.386    0.0000 
Residual      1980.435        254          7.797                     
Total         3827.765        259                                    
---------------------------------------------------------------------

                                      Parameter Estimates                                        
------------------------------------------------------------------------------------------------
           model       Beta    Std. Error    Std. Beta      t        Sig       lower      upper 
------------------------------------------------------------------------------------------------
     (Intercept)      0.165         0.582                  0.283    0.777     -0.981      1.310 
poverty_relative      0.029         0.020        0.104     1.509    0.133     -0.009      0.068 
poverty_absolute      0.018         0.024        0.057     0.757    0.450     -0.029      0.065 
      inequality    -22.041         3.030       -0.742    -7.274    0.000    -28.008    -16.073 
     income_mean      0.005         0.000        3.634    10.779    0.000      0.004      0.006 
   income_median     -0.005         0.001       -2.757    -8.686    0.000     -0.006     -0.004 
------------------------------------------------------------------------------------------------

Filter out rows with NA or Inf in the relevant columns

combined_data_murder_st_clean <- combined_data_murder_st %>%
  filter(
    !is.na(crimes) & !is.na(poverty_relative) & !is.na(poverty_absolute) &
    !is.na(inequality) & !is.na(income_mean) & !is.na(income_median) &
    is.finite(crimes) & is.finite(poverty_relative) & is.finite(poverty_absolute) &
    is.finite(inequality) & is.finite(income_mean) & is.finite(income_median)
  )
bw.fixed <- bw.gwr(formula = crimes ~ poverty_relative + poverty_absolute + inequality +
                     income_mean + income_median, 
                   data=combined_data_murder_st_clean, 
                   approach="CV", 
                   kernel="gaussian", 
                   adaptive=FALSE, 
                   longlat=FALSE)
Fixed bandwidth: 1294962 CV score: 2081.87 
Fixed bandwidth: 800490.4 CV score: 2045.571 
Fixed bandwidth: 494890.3 CV score: 2037.256 
Fixed bandwidth: 306019.1 CV score: 2028.107 
Fixed bandwidth: 189290.2 CV score: 2001.025 
Fixed bandwidth: 117147.8 CV score: 1921.412 
Fixed bandwidth: 72561.4 CV score: 1865.845 
Fixed bandwidth: 45005.46 CV score: 2258.148 
Fixed bandwidth: 89591.91 CV score: 1860.898 
Fixed bandwidth: 100117.3 CV score: 1875.523 
Fixed bandwidth: 83086.83 CV score: 1859.626 
Fixed bandwidth: 79066.47 CV score: 1860.707 
Fixed bandwidth: 85571.55 CV score: 1859.599 
Fixed bandwidth: 87107.19 CV score: 1859.875 
Fixed bandwidth: 84622.47 CV score: 1859.545 
Fixed bandwidth: 84035.91 CV score: 1859.552 
Fixed bandwidth: 84984.99 CV score: 1859.556 
Fixed bandwidth: 84398.42 CV score: 1859.544 
Fixed bandwidth: 84259.96 CV score: 1859.546 
Fixed bandwidth: 84484 CV score: 1859.544 
Fixed bandwidth: 84536.89 CV score: 1859.544 
Fixed bandwidth: 84451.31 CV score: 1859.544 
gwr.fixed <- gwr.basic(formula = crimes ~ poverty_relative + poverty_absolute + inequality +
                     income_mean + income_median, 
                       data=combined_data_murder_st_clean, 
                       bw=bw.fixed, 
                       kernel = 'gaussian', 
                       longlat = FALSE)
gwr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-11-03 05:14:25.654907 
   Call:
   gwr.basic(formula = crimes ~ poverty_relative + poverty_absolute + 
    inequality + income_mean + income_median, data = combined_data_murder_st_clean, 
    bw = bw.fixed, kernel = "gaussian", longlat = FALSE)

   Dependent (y) variable:  crimes
   Independent variables:  poverty_relative poverty_absolute inequality income_mean income_median
   Number of data points: 260
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-5.8826 -1.4242 -0.3438  0.9802 13.9557 

   Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
   (Intercept)       1.647e-01  5.817e-01   0.283    0.777    
   poverty_relative  2.942e-02  1.950e-02   1.509    0.133    
   poverty_absolute  1.813e-02  2.395e-02   0.757    0.450    
   inequality       -2.204e+01  3.030e+00  -7.274 4.33e-12 ***
   income_mean       5.324e-03  4.939e-04  10.779  < 2e-16 ***
   income_median    -4.848e-03  5.581e-04  -8.686 4.65e-16 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 2.792 on 254 degrees of freedom
   Multiple R-squared: 0.4826
   Adjusted R-squared: 0.4724 
   F-statistic: 47.39 on 5 and 254 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 1980.435
   Sigma(hat): 2.770578
   AIC:  1279.749
   AICc:  1280.194
   BIC:  1083.599
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 84484 
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                           Min.     1st Qu.      Median     3rd Qu.    Max.
   Intercept        -47.2950014  -0.7340682  -0.0773994   0.7073472  2.6754
   poverty_relative  -0.2973646  -0.0618064  -0.0199016   0.0046969  0.2697
   poverty_absolute  -0.1670997  -0.0053479   0.0356065   0.0853528  1.0949
   inequality       -32.9982183 -22.0400396 -10.5543811  -2.3699028 88.4842
   income_mean       -0.0018564   0.0022187   0.0034693   0.0064328  0.0077
   income_median     -0.0074840  -0.0060976  -0.0030624  -0.0016581  0.0021
   ************************Diagnostic information*************************
   Number of data points: 260 
   Effective number of parameters (2trace(S) - trace(S'S)): 70.86396 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 189.136 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 1284.53 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 1192.465 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 1190.494 
   Residual sum of squares: 1201.836 
   R-square value:  0.6860216 
   Adjusted R-square value:  0.5677575 

   ***********************************************************************
   Program stops at: 2024-11-03 05:14:26.390273 
combined_data_murder_st_clean_adaptive <- st_as_sf(gwr.fixed$SDF) %>%
  st_transform(crs=3168)
gwr.fixed.output <- as.data.frame(gwr.fixed$SDF)
condo_resale.sf.adaptive <- cbind(combined_data_murder_st_clean, as.matrix(gwr.fixed.output))
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(combined_data_murder_st_clean_adaptive) +
  tm_polygons(col = "Local_R2", alpha = 0.6) +
  tm_view(set.zoom.limits = c(5, 9))
tmap_mode("plot")
tmap mode set to plotting
tmap_mode("view")
tmap mode set to interactive viewing
Local_R2 <- tm_shape(combined_data_murder_st_clean_adaptive) +
  tm_polygons(col = "Local_R2", alpha = 0.6) +
  tm_view(set.zoom.limits = c(5, 9))

Inequality_TV <- tm_shape(combined_data_murder_st_clean_adaptive) +
  tm_polygons(col = "inequality_TV", alpha = 0.6) +
  tm_view(set.zoom.limits = c(5, 9))

tmap_arrange(Local_R2, Inequality_TV, asp = 1, ncol = 2, sync = TRUE)
Variable(s) "inequality_TV" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
tmap_mode("plot")
tmap mode set to plotting